A good first step is to review the data that we will be working with. First we should know the name of the factors contained in our data, the shape they are currently in and some basic summary statistics.
names(WineData)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
str(WineData)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
summary(WineData)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
From the output we can see there are 1599 observatins in the data across 17 variables, though one variable ‘X’ is simply used as a unique identifier for our entries. The majority of our variables appear to be continuous in nature, with the exception of quality and rating, which appear to be discrete in nature. This makes sense given that things like quality and rating are typically measured on something like a likert scale. From the variable descriptions, it appears that fixed.acidity ~ volatile.acidity and free.sulfur.dioxide ~ total.sulfur.dioxide may possibly be dependent, subsets of each other.
The focus of this analysis is on the factors contributing to wine quality. And since we’re primarily interested in quality, we shoudl provide additional explanation of what we have find so far from summary and makeup of quality.
Some initial observations here: - From the literature, quality was measures on a 0-10 scale, and was rated by at least 3 wine experts. The values ranged only from 3 to 8, with a mean of 5.6 and median of 6. - All other variables seem to be continuous quantities (w/ the exception of the .sulfur.dioxide suffixes).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
To first explore this data visually, I’ll draw up quick histograms of all 12 variables to get a better idea as to the shape of our data. The intention here is to see a quick distribution of the values. Viewing histograms and boxplots together make it easier to see the affect of outliers on the distribution.
## Warning: Ignoring unknown parameters: horizontal
## Warning: Ignoring unknown parameters: horizontal
## Warning: Ignoring unknown parameters: horizontal
## Warning: Ignoring unknown parameters: horizontal
## Warning: Ignoring unknown parameters: horizontal
## Warning: Ignoring unknown parameters: horizontal
## Warning: Ignoring unknown parameters: horizontal
## Warning: Ignoring unknown parameters: horizontal
## Warning: Ignoring unknown parameters: horizontal
## Warning: Ignoring unknown parameters: horizontal
## Warning: Ignoring unknown parameters: horizontal
- It appears that density and pH are normally distributed, with few outliers. - Fixed and volatile acidity, sulfur dioxides, sulphates, and alcohol seem to be long-tailed. - In looking at residual sugar and chlorides there appear to be outliers, though using a histogram to visulizae this isn’t the best idea. Looking at the box plot helps to confirm our suspicions of outliers.
Only a few of the factors appear to be normally distributed, density and pH. Fixed acidity and volatile aciidity appear to be somewhat bimodal. While citric.acid and free.slfur appear to have a plateau distribution with choloride and total sulfer with a left skewed distribution. While we could play around with decreasing or increasing bin sizes to achieve a normal distribution this would distort the data and not something we want todo in this exploratory phase.
Although wine quality has a discrete range of only 3-8, we can roughly see that there is some amount of normal distribution. A large majority of the wines examined received ratings of 5 or 6, and very few received 3, 4, or 8.
## 3 4 5 6 7 8
##
## 10 53 681 638 199 18
Given the ratings and distribution of wine quality, I’ll instantiate another categorical variable, classifying the wines as ‘poor’ (rating 0 to 4), ‘average’ (rating 5 or 6), and ‘good’ (rating 7 to 10).
## poor average good
## 63 1319 217
When plotted on a base 10 logarithmic scale, fixed.acidity and volatile.acidity appear to be normally-distributed. This makes sense, considering that pH is normally distributed, and pH, by definition, is a measure of acidity and is on a logarithmic scale. However, citric.acid, did not appear to be normally-distributed on a logarithmic scale. Upon further investigation:
## [1] 132
The initial plot for citric.acid appears to have a large number of observations with the value of zero. In an attempt to have a more prcise count, lets get an exact number. The exact number of observations with the value of zero is 132. This yields some concerns on whether or not these 132 values were reported or not, considering that the next ‘bin’ higher contains only 32 observations.
Given that the number of factors is relatively small, examining all of them is not out of the question in exploring their relationship to rating. Doing so will help to narrow down which factors impact rating.
I instantiated an ordered factor, rating, classifying each wine sample as ‘poor’, ‘average’, or ‘good’.
Upon further examination of the data set documentation, it appears that fixed.acidity and volatile.acidity are different types of acids; tartaric acid and acetic acid. I decided to create a combined variable, TAC.acidity, containing the sum of tartaric, acetic, and citric acid.
Combining tartaric, acetic, and citric acid together to create TAC.acidity provides us with a somewhat normal distribution that skews slightly to the left.
While we examined the distributions above using boxplots. Bivariate boxplots, with X as rating or quality, will be more interesting in showing trends with wine quality.
To get a quick snapshot of how the variables affect quality, I generated box plots for each.
From exploring these plots, it seems that a ‘good’ wine generally has these trends:
Residual sugar and sulfur dioxides did not seem to have a dramatic impact on the quality or rating of the wines. Interestingly, it appears that different types of acid affect wine quality different; as such, TAC.acidity saw an attenuated trend, as the presence of volatile (acetic) acid accompanied decreased quality.
By utilizing cor.test, I calculated the correlation for each of these variables against quality:
## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## TAC.acidity log10.residual.sugar log10.chlordies
## 0.10375373 0.02353331 -0.17613996
## free.sulfur.dioxide total.sulfur.dioxide density
## -0.05065606 -0.18510029 -0.17491923
## pH log10.sulphates alcohol
## -0.05773139 0.30864193 0.47616632
Quantitatively, it appears that the following variables have relatively higher correlations to wine quality:
Lets see what other correlations that may exist through a scatterplot matrix.
This scatterplot helps us to see that fixed.acidity and citric acid have a positive linear relationship. There may also be a positive relationship between citirc acid and log10(sulphates) as well.
Let’s create a vizulize to hlep us see these correlations in addition to scatterplot matrix for all variables too.
## Warning in ggcorr(WineData): data in column(s) 'quality', 'rating' are not
## numeric and were ignored
While looking at a scatterplot of the varibvles above can provide us with an indication of correlation, using this plot helps us confirm what our eyes see through a correlation matrix using correlation coefficients. Here we can see a strong correlations between fixed accidity and TAC.acidity, density,citric acid and volatile acidity. We also pH, density and total sulfur dioxide. This helps to confirm what we suspected earlier with them possibly being dependent, subsets of each other.
Examining the acidity variables, I saw strong correlations between them:
##
## Pearson's product-moment correlation
##
## data: WineData$fixed.acidity and WineData$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
##
## Pearson's product-moment correlation
##
## data: WineData$volatile.acidity and WineData$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
##
## Pearson's product-moment correlation
##
## data: log10(WineData$TAC.acidity) and WineData$pH
## t = -39.663, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7283140 -0.6788653
## sample estimates:
## cor
## -0.7044435
Most notably, base 10 logarithm TAC.acidity correlated very well with pH. This is certainly expected, as pH is essentially a measure of acidity. An interesting question to pose, using basic chemistry knowledge, is to ask what other components other than the measured acids are affecting pH. We can quantify this difference by building a predictive linear model, to predict pH based off of TAC.acidity and capture the % difference as a new variable.
## No summary function supplied, defaulting to `mean_se()
The median % error hovered at or near zero for most wine qualities. Notably, wines rated with a quality of 3 had large negative error. We can interpret this finding by saying that for many of the ‘bad’ wines, total acidity from tartaric, acetic, and citric acids were a worse predictor of pH. Simply put, it is likely that there were other components–possibly impurities–that changed and affected the pH.
As annotated previously, I hypothesized that free.sulfur.dioxide and total.sulfur.dioxide were dependent on each other. Plotting this:
##
## Pearson's product-moment correlation
##
## data: WineData$free.sulfur.dioxide and WineData$total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6395786 0.6939740
## sample estimates:
## cor
## 0.6676665
It is clear that there is a very strong relationship between the two. Aside from TAC.acidity, this seemed to be the strongest bivariate relationship. Additionally, despite the telling name descriptions, the clear ‘floor’ on this graph hints that free.sulfur.dioxide is a subset of total.sulfur.dioxide.
Let’s see how these variables compare, plotted against each other and faceted by wine rating to have a better look at the distribution from the scatterplot matrix:
The relative value of these scatterplots are suspect; if anything, it illustrates how heavily alcohol content affects rating. The weakest bivariate relationship appeared to be alcohol vs. citric acid. The plots were nearly uniformly-distributed. The strongest relationship appeared to be volatile acididty vs. citric acid, which had a negative correlation.
Both volative acidity and citric acid have negative relationship for each category of wine. While Log10(sulphates) and alcohol appear to have a positive relationship with one another for average wines. However, for poor wines that appear to have a negative reltionship and to a point there is a positive relationship as it relates to good wines. A strong combintion of alcohol and pH have a very strong relationship for each category of wine, which may indicate that one affects the other.
I primarily examined the 4 features which showed high correlation with quality. These scatterplots were a bit crowded, so I faceted by rating to illustrate the population differences between good wines, average wines, and poor wines. It’s clear that a higher citric acid and lower volatile (acetic) acid contributes towards better wines. Likewise, better wines tended to have higher sulphates and alcohol content. Surprisingly, pH had very little visual impact on wine quality, and was shadowed by the larger impact of alcohol. Interestingly, this shows that what makes a good wine depends on the type of acids that are present.
These subplots were created to demonstrate the effect of acidity and pH on wine quality. Generally, higher acidity (or lower pH) is seen in highly-rated wines. To caveat this, a presence of volatile (acetic) acid negatively affected wine quality. Citric acidity had a high correlation with wine quality, while fixed (tartaric) acid had a smaller impact.
These boxplots demonstrate the effect of alcohol content on wine quality. Generally, higher alcohol content correlated with higher wine quality. However, as the outliers and intervals show, alchol content alone did not produce a higher quality.
## `geom_smooth()` using method = 'loess'
This is perhaps the most telling graph. I subsetted the data to remove the ‘average’ wines, or any wine with a rating of 5 or 6. As the correlation tests show, wine quality was affected most strongly by alcohol and log10.sulphates. While the boundaries are not as clear cut or modal, it’s apparent that high log10.sulphates–with few exceptions–kept wine quality down. A combination of high alcohol content and low log10.sulphates produced better wines.
Through this exploratory analysis, certain factors determine and drive wine quality, mainly: alcohol content, sulphates, and acidity. Something to keep in mind in how this data was collected is that it used human ratings, which can be extremely subjective. That said, the correlations for these variables are within reasonable bounds. The graphs adequately illustrate the factors that make good wines ‘good’ and poor wines ‘poor’. Further study with inferential statistics (t-test, ANOVA, etc…) could be done to quantitatively confirm these assertions. I also believe a deeper understanding of the field would add great insight to these findings. Having a sense as to the context of data and the field of which it belongs can be a helpful guide in such an analysis.
Not having a deep knowledge of wine and all the variables that go into the wine making process and growing of the grapes lead to a sense of “feeling around in the dark,” so to speak. From my noval understanding of wine,anecdotally I’ve hear that the age of wine can also impact it’s quality. I’m sure other variables such as region and grape type could also lead to perceieved quality. Gaining additional knowledge and variables could add additional context to such an analysis.